By: Jacob Menzinga (357758)
In the united states the crime rates steadily increased up untill the 1990's. Then, a sharp decrease in crime rates was observed. To explain this decrease, researchers have come up with multple reasons. One factor that drew attention was lead exposure. This lead to the Lead-Crime hypothesis. A lot of papers since have been published on this. One example is a meta analysis published by Anthony Higney, et al. (2022) in wich they write: "Our estimates suggest the abatement of lead pollution may be responsible for 7–28% of the fall in homicide in the US"
For the final asessment of programming 1 I decided to investigate if such a relation is also visible in the netherlands. To take blood-lead values of the whole population of the Netherlands is a bit unethical and expensive so I decided to look at the amount of lead in wastewater instead
A higher amount of lead in wastewater correlates to a higher incidence of violent crimes
Wastewater treatment in the netherlands
From this dataset the following features were selected:
Onderwerpen -> Aanvoer van afvalwater -> Hoeveelheden:
Regios:
Perioden:
Registered crime in the netherlands
From this dataset the following features were selected:
Onderwerpen -> Geregistreerde misdrijven:
Soort misdrijf:
Regios:
Perioden:
# Imports
import yaml
import pandas as pd
import numpy as np
from bokeh.io import output_notebook, show
from bokeh.plotting import figure, show
output_notebook()
import hvplot.pandas
import holoviews as hv
hv.extension('bokeh')
import folium
from folium import plugins
import DS_Module as ds
def check_data(df):
"""
A function to check any dataframe for:
- Missing data
- Datatypes
- Descriptive statistics
It then prints it's findings
Args:
df (pd.DataFrame): Any dataframe.
"""
missing_data = df.isna().sum()
if missing_data.values.sum() == 0:
print('Missing data:')
print('No missing data :)')
missing_loc = 'None'
else:
# missing_data['perc. of data'] = df.isna().sum()/(len(df))*100
missing_loc = df[df.isnull().any(axis=1)]
print(f"Missing data per column:\n{missing_data}\n")
print("The missing data is located in the following rows:")
print(missing_loc)
dtypes = df.dtypes
print('\nData types:')
print(dtypes)
describe = df.describe()
print(f'\nDescription of the dataframe')
print(describe)
with open('config.yaml') as stream:
config = yaml.safe_load(stream)
crime_df = pd.read_csv(config['crime'], delimiter=';')
lead_df = pd.read_csv(config['lead'], delimiter=';')
First I'll have a look at crime_df
crime_df.rename(columns= {'SoortMisdrijf':'Crime',
'RegioS':'Province', 'Perioden':'Year',
'GeregistreerdeMisdrijvenPer1000Inw_3':'Incidence'},
inplace=True)
crime_df
| ID | Crime | Province | Year | Incidence | |
|---|---|---|---|---|---|
| 0 | 17712 | CRI1110 | PV20 | 2010JJ00 | 1.0 |
| 1 | 17713 | CRI1110 | PV20 | 2011JJ00 | 0.7 |
| 2 | 17714 | CRI1110 | PV20 | 2012JJ00 | 0.6 |
| 3 | 17715 | CRI1110 | PV20 | 2013JJ00 | 0.5 |
| 4 | 17716 | CRI1110 | PV20 | 2014JJ00 | 0.4 |
| ... | ... | ... | ... | ... | ... |
| 996 | 417702 | CRI7000 | PV99 | 2016JJ00 | NaN |
| 997 | 417703 | CRI7000 | PV99 | 2017JJ00 | NaN |
| 998 | 417704 | CRI7000 | PV99 | 2018JJ00 | NaN |
| 999 | 417705 | CRI7000 | PV99 | 2019JJ00 | NaN |
| 1000 | 417706 | CRI7000 | PV99 | 2020JJ00 | NaN |
1001 rows × 5 columns
check_data(crime_df)
Missing data per column:
ID 0
Crime 0
Province 0
Year 0
Incidence 60
dtype: int64
The missing data is located in the following rows:
ID Crime Province Year Incidence
132 17856 CRI1110 PV99 2010JJ00 NaN
133 17857 CRI1110 PV99 2011JJ00 NaN
134 17858 CRI1110 PV99 2012JJ00 NaN
135 17859 CRI1110 PV99 2013JJ00 NaN
136 17860 CRI1110 PV99 2014JJ00 NaN
137 17861 CRI1110 PV99 2015JJ00 NaN
138 17862 CRI1110 PV99 2016JJ00 NaN
140 17864 CRI1110 PV99 2018JJ00 NaN
141 17865 CRI1110 PV99 2019JJ00 NaN
142 17866 CRI1110 PV99 2020JJ00 NaN
275 82536 CRI1500 PV99 2010JJ00 NaN
277 82538 CRI1500 PV99 2012JJ00 NaN
278 82539 CRI1500 PV99 2013JJ00 NaN
280 82541 CRI1500 PV99 2015JJ00 NaN
281 82542 CRI1500 PV99 2016JJ00 NaN
282 82543 CRI1500 PV99 2017JJ00 NaN
285 82546 CRI1500 PV99 2020JJ00 NaN
418 111936 CRI2100 PV99 2010JJ00 NaN
419 111937 CRI2100 PV99 2011JJ00 NaN
420 111938 CRI2100 PV99 2012JJ00 NaN
421 111939 CRI2100 PV99 2013JJ00 NaN
422 111940 CRI2100 PV99 2014JJ00 NaN
423 111941 CRI2100 PV99 2015JJ00 NaN
424 111942 CRI2100 PV99 2016JJ00 NaN
425 111943 CRI2100 PV99 2017JJ00 NaN
426 111944 CRI2100 PV99 2018JJ00 NaN
427 111945 CRI2100 PV99 2019JJ00 NaN
428 111946 CRI2100 PV99 2020JJ00 NaN
561 147216 CRI2210 PV99 2010JJ00 NaN
562 147217 CRI2210 PV99 2011JJ00 NaN
564 147219 CRI2210 PV99 2013JJ00 NaN
567 147222 CRI2210 PV99 2016JJ00 NaN
568 147223 CRI2210 PV99 2017JJ00 NaN
569 147224 CRI2210 PV99 2018JJ00 NaN
571 147226 CRI2210 PV99 2020JJ00 NaN
705 194257 CRI2300 PV99 2011JJ00 NaN
706 194258 CRI2300 PV99 2012JJ00 NaN
712 194264 CRI2300 PV99 2018JJ00 NaN
847 235416 CRI3000 PV99 2010JJ00 NaN
848 235417 CRI3000 PV99 2011JJ00 NaN
849 235418 CRI3000 PV99 2012JJ00 NaN
850 235419 CRI3000 PV99 2013JJ00 NaN
851 235420 CRI3000 PV99 2014JJ00 NaN
852 235421 CRI3000 PV99 2015JJ00 NaN
853 235422 CRI3000 PV99 2016JJ00 NaN
854 235423 CRI3000 PV99 2017JJ00 NaN
855 235424 CRI3000 PV99 2018JJ00 NaN
856 235425 CRI3000 PV99 2019JJ00 NaN
857 235426 CRI3000 PV99 2020JJ00 NaN
990 417696 CRI7000 PV99 2010JJ00 NaN
991 417697 CRI7000 PV99 2011JJ00 NaN
992 417698 CRI7000 PV99 2012JJ00 NaN
993 417699 CRI7000 PV99 2013JJ00 NaN
994 417700 CRI7000 PV99 2014JJ00 NaN
995 417701 CRI7000 PV99 2015JJ00 NaN
996 417702 CRI7000 PV99 2016JJ00 NaN
997 417703 CRI7000 PV99 2017JJ00 NaN
998 417704 CRI7000 PV99 2018JJ00 NaN
999 417705 CRI7000 PV99 2019JJ00 NaN
1000 417706 CRI7000 PV99 2020JJ00 NaN
Data types:
ID int64
Crime object
Province object
Year object
Incidence object
dtype: object
Description of the dataframe
ID
count 1001.000000
mean 172349.000000
std 120100.686889
min 17712.000000
25% 82508.000000
50% 147149.000000
75% 235310.000000
max 417706.000000
Two things I took away from the datacheck: 1) There are a lot of missing values in the PV99 region. I looked this region code up in the metadata file of the crime dataset (also downloadable from the above link) and this is a category for 'uncatogarisable' data so I will drop these rows.
2) I want to turn the Year and Incidence columns into int and float dtypes respectively
# Dropping the PV99 region
crime_df = crime_df[crime_df['Province'] != 'PV99 ']
crime_df
# This seems to have messed up the index a bit,
# concidering there's 924 rows in the df but the index gows up to 989.
| ID | Crime | Province | Year | Incidence | |
|---|---|---|---|---|---|
| 0 | 17712 | CRI1110 | PV20 | 2010JJ00 | 1.0 |
| 1 | 17713 | CRI1110 | PV20 | 2011JJ00 | 0.7 |
| 2 | 17714 | CRI1110 | PV20 | 2012JJ00 | 0.6 |
| 3 | 17715 | CRI1110 | PV20 | 2013JJ00 | 0.5 |
| 4 | 17716 | CRI1110 | PV20 | 2014JJ00 | 0.4 |
| ... | ... | ... | ... | ... | ... |
| 985 | 417690 | CRI7000 | PV31 | 2016JJ00 | 0.3 |
| 986 | 417691 | CRI7000 | PV31 | 2017JJ00 | 0.3 |
| 987 | 417692 | CRI7000 | PV31 | 2018JJ00 | 0.4 |
| 988 | 417693 | CRI7000 | PV31 | 2019JJ00 | 0.5 |
| 989 | 417694 | CRI7000 | PV31 | 2020JJ00 | 0.5 |
924 rows × 5 columns
crime_df = crime_df.reset_index().drop('index', axis=1)
# Checking the values in the Incidence and Year columns
print(crime_df['Incidence'].unique())
print(crime_df['Year'].unique())
[' 1.0' ' 0.7' ' 0.6' ' 0.5' ' 0.4' ' 0.3' ' 0.2' ' .' ' 1.1' ' 1.2' ' 0.9' ' 0.8' ' 1.6' ' 1.5' ' 1.3' ' 0.1' ' 0.0' ' 9.5' ' 9.1' ' 8.5' ' 7.4' ' 6.6' ' 5.9' ' 5.7' ' 4.7' ' 5.1' ' 4.9' ' 8.8' ' 7.8' ' 6.7' ' 6.1' ' 5.6' ' 4.4' ' 4.1' ' 4.3' ' 4.6' ' 8.6' ' 8.4' ' 7.0' ' 5.5' ' 3.7' ' 4.0' ' 7.9' ' 7.6' ' 6.5' ' 10.6' ' 10.3' ' 8.2' ' 7.1' ' 5.8' ' 5.0' ' 4.5' ' 4.8' ' 8.9' ' 8.1' ' 7.3' ' 6.3' ' 5.3' ' 4.2' ' 3.6' ' 8.0' ' 9.0' ' 9.8' ' 7.5' ' 6.8' ' 10.2' ' 6.4' ' 6.9' ' 3.8' ' 9.3' ' 10.1' ' 7.7' ' 7.2' ' 6.2' ' 5.2' ' 5.4' ' 6.0' ' 3.4'] ['2010JJ00' '2011JJ00' '2012JJ00' '2013JJ00' '2014JJ00' '2015JJ00' '2016JJ00' '2017JJ00' '2018JJ00' '2019JJ00' '2020JJ00']
I'm assuming the ' .' values for incidence should be 0, to check this I want to plot the 5 datapoints before these values to see if they show a trend decreasing toward 0
missing_crime = crime_df[crime_df['Incidence']==' .'].index
missing_crime = np.array(missing_crime)
trend_ind = np.empty(0)
for i in range (6):
trend_ind = np.append(trend_ind, missing_crime-i)
trend_ind = np.unique(trend_ind)
# this list I will use after after replacing the '.' values to 0 to check if
# my assumption was correct
# replacing the ' .' value with 0
crime_df['Incidence'] = crime_df['Incidence'].str.replace(' .', '0', regex=False)
# Typecasting the Year and Incidence columns
crime_df['Year'] = crime_df['Year'].str.replace('JJ00','').astype(int)
crime_df['Incidence'] = crime_df['Incidence'].astype(float)
# Plotting the incidence, per region and crime. 0 values are interpolated.
hv.output(widget_location='top_left')
crime_df.iloc[trend_ind
].sort_values(by='Year').hvplot.line(x='Year',
y='Incidence',
groupby=['Crime', 'Province'])
# Now that that's done, I'll run the check data again to see if I got rid of all the missing data
check_data(crime_df)
Missing data:
No missing data :)
Data types:
ID int64
Crime object
Province object
Year int32
Incidence float64
dtype: object
Description of the dataframe
ID Year Incidence
count 924.000000 924.00000 924.000000
mean 172343.000000 2015.00000 1.885498
std 120105.690147 3.16399 2.661858
min 17712.000000 2010.00000 0.000000
25% 82499.500000 2012.00000 0.200000
50% 147143.000000 2015.00000 0.400000
75% 235306.500000 2018.00000 4.125000
max 417694.000000 2020.00000 10.600000
The crime_df has the different types of crime in one column, I would like each crime as a different feature with he incedence as their value
crime_df = crime_df.set_index(['Province','Year']).pivot(columns='Crime', values='Incidence').reset_index()
crime_df
| Crime | Province | Year | CRI1110 | CRI1500 | CRI2100 | CRI2210 | CRI2300 | CRI3000 | CRI7000 |
|---|---|---|---|---|---|---|---|---|---|
| 0 | PV20 | 2010 | 1.0 | 0.0 | 9.5 | 0.5 | 0.6 | 8.6 | 0.3 |
| 1 | PV20 | 2011 | 0.7 | 0.0 | 9.1 | 0.4 | 0.6 | 8.0 | 0.4 |
| 2 | PV20 | 2012 | 0.6 | 0.0 | 8.5 | 0.5 | 0.5 | 7.7 | 0.3 |
| 3 | PV20 | 2013 | 0.5 | 0.0 | 7.4 | 0.3 | 0.3 | 7.2 | 0.3 |
| 4 | PV20 | 2014 | 0.4 | 0.0 | 6.6 | 0.2 | 0.3 | 6.6 | 0.2 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 127 | PV31 | 2016 | 0.4 | 0.1 | 5.6 | 0.2 | 0.4 | 5.0 | 0.3 |
| 128 | PV31 | 2017 | 0.3 | 0.0 | 4.9 | 0.2 | 0.4 | 4.7 | 0.3 |
| 129 | PV31 | 2018 | 0.3 | 0.0 | 4.3 | 0.2 | 0.3 | 4.6 | 0.4 |
| 130 | PV31 | 2019 | 0.4 | 0.1 | 4.8 | 0.2 | 0.3 | 4.6 | 0.5 |
| 131 | PV31 | 2020 | 0.3 | 0.1 | 4.6 | 0.1 | 0.3 | 4.1 | 0.5 |
132 rows × 9 columns
And also a total incidence would be usefull
crime_df['Total_incidence'] = crime_df[[
'CRI1110', 'CRI1500', 'CRI2100', 'CRI2210', 'CRI2300', 'CRI3000', 'CRI7000']
].sum(axis=1)
crime_df
| Crime | Province | Year | CRI1110 | CRI1500 | CRI2100 | CRI2210 | CRI2300 | CRI3000 | CRI7000 | Total_incidence |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | PV20 | 2010 | 1.0 | 0.0 | 9.5 | 0.5 | 0.6 | 8.6 | 0.3 | 20.5 |
| 1 | PV20 | 2011 | 0.7 | 0.0 | 9.1 | 0.4 | 0.6 | 8.0 | 0.4 | 19.2 |
| 2 | PV20 | 2012 | 0.6 | 0.0 | 8.5 | 0.5 | 0.5 | 7.7 | 0.3 | 18.1 |
| 3 | PV20 | 2013 | 0.5 | 0.0 | 7.4 | 0.3 | 0.3 | 7.2 | 0.3 | 16.0 |
| 4 | PV20 | 2014 | 0.4 | 0.0 | 6.6 | 0.2 | 0.3 | 6.6 | 0.2 | 14.3 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 127 | PV31 | 2016 | 0.4 | 0.1 | 5.6 | 0.2 | 0.4 | 5.0 | 0.3 | 12.0 |
| 128 | PV31 | 2017 | 0.3 | 0.0 | 4.9 | 0.2 | 0.4 | 4.7 | 0.3 | 10.8 |
| 129 | PV31 | 2018 | 0.3 | 0.0 | 4.3 | 0.2 | 0.3 | 4.6 | 0.4 | 10.1 |
| 130 | PV31 | 2019 | 0.4 | 0.1 | 4.8 | 0.2 | 0.3 | 4.6 | 0.5 | 10.9 |
| 131 | PV31 | 2020 | 0.3 | 0.1 | 4.6 | 0.1 | 0.3 | 4.1 | 0.5 | 10.0 |
132 rows × 10 columns
Now its time for lead_df
lead_df.rename(columns={'RegioS':'Province', 'Perioden':'Year',
'VolumeAfvalwater_43':'Vol_Wastewater',
'Lood_52':'Lead'}, inplace= True)
lead_df
| ID | Province | Year | Vol_Wastewater | Lead | |
|---|---|---|---|---|---|
| 0 | 187 | PV20 | 2010JJ00 | 74963 | 1619 |
| 1 | 188 | PV20 | 2011JJ00 | 72217 | 1202 |
| 2 | 189 | PV20 | 2012JJ00 | 72487 | 1485 |
| 3 | 190 | PV20 | 2013JJ00 | 67297 | 1057 |
| 4 | 191 | PV20 | 2014JJ00 | 64758 | 931 |
| ... | ... | ... | ... | ... | ... |
| 127 | 556 | PV31 | 2016JJ00 | 158787 | 3327 |
| 128 | 557 | PV31 | 2017JJ00 | 148311 | . |
| 129 | 558 | PV31 | 2018JJ00 | 131743 | 4911 |
| 130 | 559 | PV31 | 2019JJ00 | 143745 | . |
| 131 | 560 | PV31 | 2020JJ00 | 148116 | 1754 |
132 rows × 5 columns
# First some general information
check_data(lead_df)
Missing data:
No missing data :)
Data types:
ID int64
Province object
Year object
Vol_Wastewater int64
Lead object
dtype: object
Description of the dataframe
ID Vol_Wastewater
count 132.000000 132.000000
mean 373.500000 158719.393939
std 114.395757 117111.886323
min 187.000000 26873.000000
25% 280.250000 64524.000000
50% 373.500000 118650.500000
75% 466.750000 237690.000000
max 560.000000 430470.000000
In this dataframe I want to change the Year and Lead columns to intergers
print(lead_df['Lead'].unique())
[' 1619' ' 1202' ' 1485' ' 1057' ' 931' ' 1629' ' 1888' ' .' ' 1299' ' 838' ' 1314' ' 1671' ' 1432' ' 1354' ' 1250' ' 1268' ' 1130' ' 1197' ' 828' ' 810' ' 1077' ' 964' ' 1024' ' 1089' ' 1105' ' 900' ' 918' ' 2044' ' 1958' ' 1730' ' 1842' ' 2588' ' 3006' ' 2689' ' 2355' ' 2440' ' 154' ' 129' ' 345' ' 592' ' 517' ' 204' ' 375' ' 245' ' 183' ' 4149' ' 3962' ' 4203' ' 4094' ' 5572' ' 2938' ' 2791' ' 2943' ' 3168' ' 2258' ' 2625' ' 1971' ' 2289' ' 1554' ' 1937' ' 1909' ' 1756' ' 1668' ' 5043' ' 5776' ' 4752' ' 7349' ' 5200' ' 5666' ' 5091' ' 5283' ' 5318' ' 9377' ' 10010' ' 8159' ' 7209' ' 8643' ' 7361' ' 7759' ' 9075' ' 6379' ' 1283' ' 667' ' 1255' ' 823' ' 1045' ' 825' ' 744' ' 668' ' 657' ' 5688' ' 7995' ' 6981' ' 4731' ' 5345' ' 4684' ' 7248' ' 5748' ' 3751' ' 2821' ' 3635' ' 2676' ' 3970' ' 3025' ' 5182' ' 3327' ' 4911' ' 1754']
In this dataset there's also ' .' values. A value of 0kg lead in the wastewater does not make sense to me given all the other values are at least above 100, and the vol_wastewater is not decreased in these rows.
I'm going to interpolate these values and plot them between non interpolated values to see if this makes sense
missing_lead = lead_df[lead_df['Lead']==' .'].index
missing_lead = np.array(missing_lead)
lead_plus_min = np.append(missing_lead-1, missing_lead+1)
lead_plus_min
array([ 6, 8, 17, 19, 28, 30, 39, 41, 50, 52, 61, 63, 72,
74, 83, 85, 94, 96, 105, 107, 116, 118, 127, 129, 8, 10,
19, 21, 30, 32, 41, 43, 52, 54, 63, 65, 74, 76, 85,
87, 96, 98, 107, 109, 118, 120, 129, 131], dtype=int64)
# replacing the ' .' value with NaN
lead_df['Lead'] = lead_df['Lead'].replace(' .', np.nan, regex=False)
# Typecasting the Year and Lead columns
lead_df['Year'] = lead_df['Year'].str.replace('JJ00','').astype(int)
lead_df['Lead'] = lead_df['Lead'].astype(float) # Float for now because NaN can't be int.
# Filling the NaN with an interpolated value
lead_df['Lead'] = lead_df['Lead'].interpolate().astype(int)
# Now to check if the interpolated data make sense
measured_lead_scatter = lead_df.iloc[lead_plus_min
].hvplot.scatter(x='Vol_Wastewater',
y='Lead',
label='Experimental data')
interpolated_lead_scatter = lead_df.iloc[missing_lead
].hvplot.scatter(x='Vol_Wastewater',
y='Lead', color='red',
label='Interpolated data')
factcheck = measured_lead_scatter * interpolated_lead_scatter
factcheck.opts(title='Measuered and interpolated lead data',
ylabel='Lead (kg)', xlabel='Volume Infulent Wastewater (m3)',
legend_position='top_left')
Now we have the amount of wastewater in 1000 m3 and the amount of lead in the water in kg, I would like to create a column with the amount of lead per m3 of water
lead_df['lead_per_m3'] = lead_df['Lead'] / lead_df['Vol_Wastewater']
# Converting lead from kilogram to gram
lead_df['lead_per_m3'] = lead_df['lead_per_m3']*1000
lead_df
| ID | Province | Year | Vol_Wastewater | Lead | lead_per_m3 | |
|---|---|---|---|---|---|---|
| 0 | 187 | PV20 | 2010 | 74963 | 1619 | 21.597321 |
| 1 | 188 | PV20 | 2011 | 72217 | 1202 | 16.644280 |
| 2 | 189 | PV20 | 2012 | 72487 | 1485 | 20.486432 |
| 3 | 190 | PV20 | 2013 | 67297 | 1057 | 15.706495 |
| 4 | 191 | PV20 | 2014 | 64758 | 931 | 14.376602 |
| ... | ... | ... | ... | ... | ... | ... |
| 127 | 556 | PV31 | 2016 | 158787 | 3327 | 20.952597 |
| 128 | 557 | PV31 | 2017 | 148311 | 4119 | 27.772721 |
| 129 | 558 | PV31 | 2018 | 131743 | 4911 | 37.277123 |
| 130 | 559 | PV31 | 2019 | 143745 | 3332 | 23.179937 |
| 131 | 560 | PV31 | 2020 | 148116 | 1754 | 11.842070 |
132 rows × 6 columns
# Lets check the data again now we're done cleaning and interpolating values.
check_data(lead_df)
Missing data:
No missing data :)
Data types:
ID int64
Province object
Year int32
Vol_Wastewater int64
Lead int32
lead_per_m3 float64
dtype: object
Description of the dataframe
ID Year Vol_Wastewater Lead lead_per_m3
count 132.000000 132.000000 132.000000 132.000000 132.000000
mean 373.500000 2015.000000 158719.393939 2966.060606 17.756022
std 114.395757 3.174324 117111.886323 2414.776220 5.011926
min 187.000000 2010.000000 26873.000000 129.000000 4.798750
25% 280.250000 2012.000000 64524.000000 1086.000000 14.341746
50% 373.500000 2015.000000 118650.500000 1964.500000 17.531057
75% 466.750000 2018.000000 237690.000000 4735.500000 20.844686
max 560.000000 2020.000000 430470.000000 10010.000000 37.277123
Now I'm going to have a look at the Region column in both DataFrames, since this is the feature I'll be merging on
print(f"""
Crime regions:
{crime_df['Province'].unique()}
Lead regions:
{lead_df['Province'].unique()}""")
Crime regions: ['PV20 ' 'PV21 ' 'PV22 ' 'PV23 ' 'PV24 ' 'PV25 ' 'PV26 ' 'PV27 ' 'PV28 ' 'PV29 ' 'PV30 ' 'PV31 '] Lead regions: ['PV20 ' 'PV21 ' 'PV22 ' 'PV23 ' 'PV24 ' 'PV25 ' 'PV26 ' 'PV27 ' 'PV28 ' 'PV29 ' 'PV30 ' 'PV31 ']
There clearly is some whitespace that needs removing.
crime_df['Province'] = crime_df['Province'].str.replace(r'\s','', regex=True)
lead_df['Province'] = lead_df['Province'].str.replace(r'\s','', regex=True)
print(f"""
Crime regions:
{crime_df['Province'].unique()}
Lead regions:
{lead_df['Province'].unique()}""")
Crime regions: ['PV20' 'PV21' 'PV22' 'PV23' 'PV24' 'PV25' 'PV26' 'PV27' 'PV28' 'PV29' 'PV30' 'PV31'] Lead regions: ['PV20' 'PV21' 'PV22' 'PV23' 'PV24' 'PV25' 'PV26' 'PV27' 'PV28' 'PV29' 'PV30' 'PV31']
Now both dataframes are ready to be merged!
lead_crime_df = lead_df.merge(crime_df, how='inner', on=['Province', 'Year'],
copy=True)
lead_crime_df = lead_crime_df.drop(['Vol_Wastewater', 'Lead', 'ID'], axis=1)
lead_crime_df['Province'] = lead_crime_df['Province'].map({'PV20': 'Groningen',
'PV21': 'Fryslân',
'PV22': 'Drenthe',
'PV23': 'Overijssel',
'PV24': 'Flevoland',
'PV25': 'Gelderland',
'PV26': 'Utrecht',
'PV27': 'Noord-Holland',
'PV28': 'Zuid-Holland',
'PV29': 'Zeeland',
'PV30': 'Noord-Brabant',
'PV31': 'Limburg'})
lead_crime_df.rename(columns={'CRI1110':'Violent theft and burglary',
'CRI1500':'Extortion',
'CRI2100':'Vandalism',
'CRI2210':'Public violence',
'CRI2300':'Arson / Explosion',
'CRI3000':'Violent an sexual crimes',
'CRI7000':'Firearms crimes'},
inplace=True)
lead_crime_df
| Province | Year | lead_per_m3 | Violent theft and burglary | Extortion | Vandalism | Public violence | Arson / Explosion | Violent an sexual crimes | Firearms crimes | Total_incidence | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Groningen | 2010 | 21.597321 | 1.0 | 0.0 | 9.5 | 0.5 | 0.6 | 8.6 | 0.3 | 20.5 |
| 1 | Groningen | 2011 | 16.644280 | 0.7 | 0.0 | 9.1 | 0.4 | 0.6 | 8.0 | 0.4 | 19.2 |
| 2 | Groningen | 2012 | 20.486432 | 0.6 | 0.0 | 8.5 | 0.5 | 0.5 | 7.7 | 0.3 | 18.1 |
| 3 | Groningen | 2013 | 15.706495 | 0.5 | 0.0 | 7.4 | 0.3 | 0.3 | 7.2 | 0.3 | 16.0 |
| 4 | Groningen | 2014 | 14.376602 | 0.4 | 0.0 | 6.6 | 0.2 | 0.3 | 6.6 | 0.2 | 14.3 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 127 | Limburg | 2016 | 20.952597 | 0.4 | 0.1 | 5.6 | 0.2 | 0.4 | 5.0 | 0.3 | 12.0 |
| 128 | Limburg | 2017 | 27.772721 | 0.3 | 0.0 | 4.9 | 0.2 | 0.4 | 4.7 | 0.3 | 10.8 |
| 129 | Limburg | 2018 | 37.277123 | 0.3 | 0.0 | 4.3 | 0.2 | 0.3 | 4.6 | 0.4 | 10.1 |
| 130 | Limburg | 2019 | 23.179937 | 0.4 | 0.1 | 4.8 | 0.2 | 0.3 | 4.6 | 0.5 | 10.9 |
| 131 | Limburg | 2020 | 11.842070 | 0.3 | 0.1 | 4.6 | 0.1 | 0.3 | 4.1 | 0.5 | 10.0 |
132 rows × 11 columns
Above is the dataframe I'll be working with to answer my research question. Below I'll provide some information
Columns:
print("CRIME DATA")
ds.DS_Q_Q_Plot(crime_df['Total_incidence'])
ds.DS_Q_Q_Hist(crime_df['Total_incidence'])
CRIME DATA Estimation method: robust n = 132, mu = 12.8, sigma = 3.855 Expected number of data outside CI: 7
Estimation method: robust mu = 12.8, sigma = 3.855
print("LEAD DATA")
ds.DS_Q_Q_Plot(lead_df['lead_per_m3'])
ds.DS_Q_Q_Hist(lead_df['lead_per_m3'])
LEAD DATA Estimation method: robust n = 132, mu = 17.53, sigma = 4.821 Expected number of data outside CI: 7
Estimation method: robust mu = 17.53, sigma = 4.821
Looking at the histogram of the total crime incidence values it's clear this data is not normally distributed. I'll take this into account when calculating a correlation between lead and crime
To see how the lead and total indicende behave together I will plot them together on a line plot over time. To account for the difference in y-values I'll first normaize the data.
normalized_lead_crime = lead_crime_df[['Province','Year',
'lead_per_m3','Total_incidence']].copy()
for col in ['lead_per_m3', 'Total_incidence']:
df_max = normalized_lead_crime[col].max()
df_min = normalized_lead_crime[col].min()
df_range = df_max - df_min
normalized_lead_crime[col] = normalized_lead_crime[col
].apply(lambda x: ((x-df_min)/df_range))
normalized_lead_crime
| Province | Year | lead_per_m3 | Total_incidence | |
|---|---|---|---|---|
| 0 | Groningen | 2010 | 0.517223 | 0.976378 |
| 1 | Groningen | 2011 | 0.364721 | 0.874016 |
| 2 | Groningen | 2012 | 0.483019 | 0.787402 |
| 3 | Groningen | 2013 | 0.335846 | 0.622047 |
| 4 | Groningen | 2014 | 0.294899 | 0.488189 |
| ... | ... | ... | ... | ... |
| 127 | Limburg | 2016 | 0.497372 | 0.307087 |
| 128 | Limburg | 2017 | 0.707362 | 0.212598 |
| 129 | Limburg | 2018 | 1.000000 | 0.157480 |
| 130 | Limburg | 2019 | 0.565952 | 0.220472 |
| 131 | Limburg | 2020 | 0.216862 | 0.149606 |
132 rows × 4 columns
normalized_lead_crime.hvplot(kind='line', x='Year',
y=['lead_per_m3','Total_incidence'],
groupby='Province').opts(legend_position='top_right')
Two observations here:
"""
Out of curiousity, I am going to have a look at the incidence of different
crimes per province, over time. I've set a fixed y-axis so the decrese of
incidence can be visually observed.
"""
Crimes = ['Violent theft and burglary','Extortion', 'Vandalism',
'Public violence', 'Arson / Explosion', 'Violent an sexual crimes',
'Firearms crimes']
hv.output(widget_location='top_left')
lead_crime_df.set_index('Province'
).hvplot(kind='bar', stacked=True,
y=Crimes, groupby='Year',
ylim=(0,25), title='Crimes per region, per year.',
ylabel='Incidence per 100.000 inhabitants',
rot=40)
To answer my research question - Is there a correlation between lead in water and violent crimes - I'm going to perform a Spearman Correlation test. I've selected the Spearman test over for other correlation tests because it allows for non normally distributed data
from scipy.stats import spearmanr
correlation, p_val = spearmanr(lead_crime_df['lead_per_m3'],
lead_crime_df['Total_incidence'])
print(f"""
The correlation between the lead per cubic meter of water and
total incidence of violent crimes is {correlation:.2f}, p value: {p_val}.
""")
The correlation between the lead per cubic meter of water and total incidence of violent crimes is 0.23, p value: 0.006989339126132411.
scatter_data = hv.Scatter(data=lead_crime_df, kdims='lead_per_m3',
vdims='Total_incidence', label='Regression line')
scatter = lead_crime_df.hvplot.scatter(
x='lead_per_m3',
y='Total_incidence',
xlabel='Lead per m3 of water (in grams)',
ylabel='Total incidence per 100.000 inhabitants',
label='Data',
title='Lead in waste water over incidence of violent crime')
Regression_line = hv.Slope.from_scatter(scatter_data).opts(color='black')
print("The black line is the regression line")
scatter.opts(legend_position='top_right') * Regression_line
The black line is the regression line
# In the above graph we see four datapoints that seem to be a bit out of the ordinary.
# I'm going to dorp them to see if this has any impact.
to_drop = lead_crime_df[(lead_crime_df['lead_per_m3'] < 6) &
(lead_crime_df['Total_incidence'] > 18)
].index.tolist()
to_drop.extend(lead_crime_df[(lead_crime_df['lead_per_m3'] > 35) &
(lead_crime_df['Total_incidence'] < 14)
].index.tolist())
to_drop
lead_crime_df.drop(to_drop, inplace=True)
correlation, p_val = spearmanr(lead_crime_df['lead_per_m3'],
lead_crime_df['Total_incidence'])
print(f"""
The correlation between the lead per cubic meter of water and
total incidence of violent crimes is {correlation:.2f}, p value: {p_val}.
""")
The correlation between the lead per cubic meter of water and total incidence of violent crimes is 0.30, p value: 0.0005395582436530309.
Removing these four datapoints does impact the correlation somewhat (from 0.23 to 0.30), More interestingly, it increases the P value to below 0.05, indicating significance.
scatter_data = hv.Scatter(data=lead_crime_df, kdims='lead_per_m3',
vdims='Total_incidence')
scatter = lead_crime_df.hvplot.scatter(
x='lead_per_m3',
y='Total_incidence',
xlabel='Lead per m3 of water (in grams)',
ylabel='Total incidence per 100.000 inhabitants',
label='Data',
title='Lead in waste water over incidence of violent crime, outliers excluded')
Regression_line2 = hv.Slope.from_scatter(scatter_data).opts(color='red')
print("The black line is the regression line with the outlires included")
print("The red line is the regression line with the outlires excluded")
(scatter * Regression_line
* Regression_line2).opts(legend_position='top_right')
The black line is the regression line with the outlires included The red line is the regression line with the outlires excluded
# This is confirmed visually.
So, looking at the total incidence and lead per m3 of water, there is no significant correlation. Next I'm going to look at the data per Province to see if there are trends visible on a smaller scale
corr_lead_crime_province = lead_crime_df[['lead_per_m3','Total_incidence','Province']
].groupby('Province'
).corr(method='spearman').reset_index(
)[::2][['Province','Total_incidence']]
corr_lead_crime_province.rename(columns={'Total_incidence':'Correlation lead & crime'},
inplace=True)
corr_lead_crime_province.set_index('Province').hvplot.bar(rot=40,
xlabel='Province of the Netherlands')
# Stratify per crime and / or Province
scatter_data = hv.Scatter(data=lead_crime_df, kdims='lead_per_m3',
vdims='Total_incidence', label='Regression line')
scatter = lead_crime_df.hvplot.scatter(
x='lead_per_m3',
y='Total_incidence',
groupby='Province',
xlabel='Lead per m3 of water (in grams)',
ylabel='Total incidence per 100.000 inhabitants',
label='Data',
title='Lead in waste water over incidence of violent crime per Province')
Regression_line2 = hv.Slope.from_scatter(scatter_data).opts(color='red')
(scatter * Regression_line * Regression_line2 ).opts(legend_position='top_right')
Finally, I want to plot the incidence and lead per m3 of wastewater per Province, over a map of the Netherlands.
# Geoplotten
def get_map():
m = folium.Map(location=[52.2594406805025, 4.952074743739346],
zoom_start=7)
url = 'https://www.webuildinternet.com/articles/2015-07-19-geojson-data-of-the-netherlands/provinces.geojson'
borderstyle = {
'color' : 'black',
'fill' : False,
'weight' : 1
}
border = folium.GeoJson(data=url, name='borders',
style_function= lambda x: borderstyle,
control = False).add_to(m)
incidence = folium.Choropleth(
geo_data=url,
name="Incidence",
data=lead_crime_df.replace('Fryslân', 'Friesland (Fryslân)'),
columns=["Province", "Total_incidence"],
key_on="properties.name", # Thanks to job for helping me figure out wich value to use here!
fill_color="YlOrRd",
fill_opacity=0.7,
line_opacity=1,
legend_name="Total violent crime incidence (per 100.000 inhabitants)",
show = False
).add_to(m)
lead = folium.Choropleth(
geo_data=url,
name="Lead",
data=lead_crime_df.replace('Fryslân', 'Friesland (Fryslân)'),
columns=["Province", "lead_per_m3"],
key_on="properties.name",
fill_color="BuPu",
fill_opacity=0.7,
line_opacity=1,
legend_name="Total violent crime incidence (per 100.000 inhabitants)",
show = False,
hilight = True
).add_to(m)
minimap = plugins.MiniMap(toggle_display=True)
m.add_child(minimap)
folium.LayerControl(collapsed=False).add_to(m)
return m
mapobject = get_map()
mapobject.save('Lead & Crime incidence NL.html')
mapobject
Conclusion: There is a low but significant correlation between lead found in wastewater and violent crimes, looking at the netherlands as a whole. Per province, the correlation between lead in influent waste water and violent crime incidence varies greatly from -0.445 to 0.745 so looking at the all the data, I suspect this the low correlation does not indicate causation.